A structural, content-similarity measure for detecting spam documents on the web

نویسندگان

  • Maria Soledad Pera
  • Yiu-Kai Ng
چکیده

Purpose The Web provides its users with abundant information. Unfortunately, when a Web search is performed, both users and search engines must deal with an annoying problem: the presence of spam documents that are ranked among legitimate ones. The mixed results downgrade the performance of search engines and frustrate users who are required to filter out useless information. To improve the quality of Web searches, the number of spam documents on the Web must be reduced, if they cannot be eradicated entirely. Design/methodology/approach In this paper, we present a novel approach for identifying spam Web documents, which have mismatched titles and bodies and/or low percentage of hidden content in markup data structure. Findings By considering the content and markup of Web documents, we develop a spam-detection tool that is (i) reliable, since we can accurately detect 84.5% of spam/legitimate Web documents, and (ii) computational inexpensive, since the wordcorrelation factors used for content analysis are precomputed. Research limitations/implications Since the bigram-correlation values employed in our spam-detection approach are computed by using the unigram-correlation factors, it imposes additional computational time during the spam-detection process and could generate higher number of misclassified spam Web documents. Originality/value We have verified that our spam-detection approach outperforms existing anti-spam methods by at least 3% in terms of F -measure.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unsupervised Spam Detection by Document Complexity Estimation

In this paper, we study a content-based spam detection for a specific type of spams called blog and bulletin board spams. We develop an efficient unsupervised algorithm DCE that, detects spam documents from a mixture of spam and non-spam documents using a compression-based similarity measure, called the document complexity. Using suffix trees, the algorithm computes the document complexity for ...

متن کامل

خوشه‌بندی فراابتکاری اسناد فارسی اِکس‌اِم‌اِل مبتنی بر شباهت ساختاری و محتوایی

Due to the increasing number of documents, XML, effectively organize these documents in order to retrieve useful information from them is essential. A possible solution is performed on the clustering of XML documents in order to discover knowledge. Clustering XML documents is a key issue of how to measure the similarity between XML documents. Conventional clustering of text documents using a do...

متن کامل

A Novel Image Structural Similarity Index Considering Image Content Detectability Using Maximally Stable Extremal Region Descriptor

The image content detectability and image structure preservation are closely related concepts with undeniable role in image quality assessment. However, the most attention of image quality studies has been paid to image structure evaluation, few of them focused on image content detectability. Examining the image structure was firstly introduced and assessed in Structural SIMilarity (SSIM) measu...

متن کامل

A novel method for detecting structural damage based on data-driven and similarity-based techniques under environmental and operational changes

The applications of time series modeling and statistical similarity methods to structural health monitoring (SHM) provide promising and capable approaches to structural damage detection. The main aim of this article is to propose an efficient univariate similarity method named as Kullback similarity (KS) for identifying the location of damage and estimating the level of damage severity. An impr...

متن کامل

An Effective Model for SMS Spam Detection Using Content-based Features and Averaged Neural Network

In recent years, there has been considerable interest among people to use short message service (SMS) as one of the essential and straightforward communications services on mobile devices. The increased popularity of this service also increased the number of mobile devices attacks such as SMS spam messages. SMS spam messages constitute a real problem to mobile subscribers; this worries telecomm...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IJWIS

دوره 5  شماره 

صفحات  -

تاریخ انتشار 2009